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Studies on the intelligibility of time-compressed speech have shown flawless performance 
for moderate compression factors, a sharp deterioration for compression factors above 
three, and an improved performance as a result of "repackaging" — a process of dividing 
the time-compressed waveform into fragments, called packets, and delivering the packets 
in a prescribed rate. This intricate pattern of performance reflects the reliability of 
the auditory system in processing speech streams with different information transfer 
rates; the knee-point of performance defines the auditory channel capacity. This study 
is concerned with the cortical computation principle that determines channel capacity. 
Oscillation-based models of speech perception hypothesize that the speech decoding 
process is guided by a cascade of oscillations with theta as "master," capable of tracking 
the input rhythm, with the theta cycles aligned with the intervocalic speech fragments 
termed ©-syllables; intelligibility remains high as long as theta is in sync with the input, 
and it sharply deteriorates once theta is out of sync. In the study described here the 
hypothesized role of theta was examined by measuring the auditory channel capacity 
of time-compressed speech undergone repackaging. For all speech speeds tested (with 
compression factors of up to eight), packaging rate at capacity equals 9 packets/s — aligned 
with the upper limit of cortical theta, 6 max (about 9 Hz) — and the packet duration equals 
the duration of one uncompressed ©-syllable divided by the compression factor. The 
alignment of both the packaging rate and the packet duration with properties of cortical 
theta suggests that the auditory channel capacity is determined by theta. Irrespective of 
speech speed, the maximum information transfer rate through the auditory channel is the 
information in one uncompressed ©-syllable long speech fragment per one 6 max cycle. 
Equivalently, the auditory channel capacity is 9 ©-syllables/s. 



Keywords: information transfer rate, auditory channel capacity, fast speech, phonetic variability, intelligibility, 
brain rhythms, theta oscillations 



1. INTRODUCTION 

How human brain circuitry enables our communication capa- 
bilities constitutes a compelling scientific challenge. We possess 
only a rudimentary understanding of neuronal computation, 
and there are only few hypotheses that link brain mechanisms 
with elementary cognitive computations that underlie process- 
ing sensory input. In the broader context, the study reported 
here aims at unveiling cortical computational principles that 
govern recognition, using the speech communication mode as a 
vehicle. 

In comprehending spoken language, the listener faces the task 
of decoding a linguistic message embedded in the acoustic wave- 
form. Since words pronounced by the same speaker — and even 
more so words pronounced by different speakers — markedly dif- 
fer in their acoustic realization, the listener is faced with the task of 
mapping a variant stimulus onto an invariant response. The ease 
by which we can comprehend speech irrespective of inter-speaker 
variability — in gender, age, accent, speed, duration — is therefore 
remarkable. The cortical computational principles that enable 
such capability are yet to be understood. 



A particular phonetic variability of interest is speech speed. 
Studies on the effects of time compression of speech on intelligi- 
bility (e.g., Garvey, 1953; Foulke and Sticht, 1969; Dupoux and 
Green, 1997; Reed and Durlach, 1998; Versfeld and Dreschler, 
2002; Peelle and Wingfield, 2005), have shown flawless perfor- 
mance for moderate compression ratios, but a sharp deterioration 
in intelligibility for compression ratios above about three (with 
word error rates greater than 50%). What is the neuronal mech- 
anism that governs insensitivity to time compression as much 
as three? And why does our tolerance to time-scale variabil- 
ity breaks down when the compression factor is greater than 
three? 

Considering speech as an inherently rhythmic phenomenon, 
in which linguistic information is pseudo-rhythmically 
transmitted in syllabic packets 1 , Ghitza and Greenberg (2009) 
questioned whether intelligibility is influenced by neuronal 



These packets are temporally structured so that most of the energy fluc- 
tuations occur in the range between 3 and 10 Hz (e.g., Greenberg, 1999; 
Greenberg and Arai, 2004). 
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oscillations. They measured the intelligibility of time-compressed 
speech subjected to "repackaging" — a process of dividing a 
time-compressed speech into fragments, called packets, and 
delivering the packets in a prescribed rate. As expected, the 
intelligibility of speech time-compressed by a factor of three (i.e., 
a high syllabic rate) was poor. Surprisingly, intelligibility was sub- 
stantially restored when the information stream was re-packaged 
by inserting gaps in between successive compressed-signal 
intervals. 

Conventional models of speech perception assume a strict 
decoding of the acoustic signal by linking time-frequency features 
of sensory input with stored time-frequency memory patterns. 
The intricate pattern of human performance as a function of 
speech speed and repackaging (i.e., the insensitivity to moder- 
ate time scale variations; the deterioration in intelligibility for 
compression factors beyond three; and the U-shaped recovery 
of intelligibility by repackaging) is difficult to explain by these 
models, but it can be accounted for by Tempo (Ghitza, 2011), 
a phenomenological model which epitomizes recently proposed 
oscillation-based models of speech perception (e.g., Poeppel, 
2003; Ahissar and Ahissar, 2005; Lakatos et al, 2005; Ding and 
Simon, 2009; Ghitza and Greenberg, 2009; Giraud and Poeppel, 
2012; Peelle and Davis, 2012). Tempo hypothesizes that the speech 
decoding process is performed within a time-varying, hierarchi- 
cal window structure synchronized with the input. The window 
structure is generated by a cascade of oscillations with theta as 
"master," capable of tracking the input pseudo-rhythm. During 
a successful tracking, the theta cycles are aligned with inter- 
vocalic speech fragments termed 6 -syllables 2 . Oscillation-based 
models hypothesize that intelligibility is correlated with the abil- 
ity of the theta oscillator to remain in sync with the input 
stream (e.g., Ghitza, 2012; Doelling et al., 2014). Intelligibility 
remains high as long as theta is in sync with the input (this 
is the case for moderate speech speeds) and sharply deterio- 
rates once theta is out of sync (when the input syllabic rate 
is beyond the theta frequency range). Since the knee-point of 
intelligibility restoration defines the maximum reliable informa- 
tion transfer rate through the auditory channel (i.e., auditory 
channel capacity), one may conclude that the tracking capa- 
bility of theta determines channel capacity. Can this conclu- 
sion account for the improvement in intelligibility gained by 
repackaging? 

In interpreting the left-hand-side of their U-shaped behavioral 
data (i.e., increased intelligibility restoration with the increase of 
gap duration) Ghitza and Greenberg suggested that the insertion 
of gaps is an act of providing extra decoding time, and that the 
gradual change in gap duration should be viewed as tuning the 
packaging rate in a search for a better synchronization between 
the input information flow and the capacity of the auditory chan- 
nel; repackaging with a gap duration (i.e., decoding time) that is 
too short results in errors due to a mismatch between the amount 
of information in the input stream (in terms of the number 
of diphones per unit time) and the capacity of the auditory 
channel (in terms of the number of reliable diphone-neuron 
activations per unit time). Consequently, they hypothesized that 



2 The ^-syllable (Ghitza, 2013), is re-introduced in section "Definitions." 



the optimal range of packaging rate is dictated by the proper- 
ties of the cortical theta, and that the best synchronization is 
achieved by tuning the packaging rate toward the mid range of 
theta (Ghitza, 201 1). Ghitza and Greenberg measured intelligibil- 
ity as a function of gap duration (read: packaging rate) at only 
one time-compression condition (compression factor of three) 
and one packet duration condition (duration of 40 ms), with the 
operating points below capacity. In the study described here, we 
measured the knee-point of intelligibility restoration as a func- 
tion of repackaging (with package duration and packaging rate 
as parameters) for fast speech with compression factors of up 
to eight. The combination of packaging rate and packet dura- 
tion at knee-point defines the maximum rate at which speech 
information can be reliably transmitted through the auditory 
channel, i.e., the auditory channel capacity. As we shall see, irre- 
spective of speech speed, the packaging rate and packet duration 
at capacity are aligned with properties of cortical theta, suggest- 
ing that the auditory channel capacity for speech is determined by 
theta. 

The remainder of the paper is organized as follows. The psy- 
chophysical procedure to measure auditory channel capacity is 
described in section "Psychophysical measurement of auditory 
channel capacity." Section "Material and methods" describes the 
speech corpus, the psychophysical paradigm, and the data analy- 
sis procedure; it also introduces definitions which will assist us in 
characterizing the relationship between the rate by which speech 
information is delivered to the listener, on the one hand, and 
intelligibility (i.e., a measure of the accuracy of speech percep- 
tion), on the other. Three experiments are reported, in which 
intelligibility (in terms of word accuracy) is measured as a func- 
tion of compression factor, packaging rate and packet duration. 
The stimulus preparation and the collected data, per experiment, 
are described in section "Results." In section "Discussion" the data 
is interpreted through the prism of oscillation based models, and 
the possible generalizability of the results to other corpora (e.g., 
languages other than English) is discussed. 

2. PSYCHOPHYSICAL MEASUREMENT OF AUDITORY 
CHANNEL CAPACITY 

Figure 1A shows a generic communication system for the trans- 
mission of a message that belongs to a set W through a noisy 
channel. The system is composed of an encoder X", the noisy 
channel, and a decoder g. The encoder maps messages W onto 
(binary) input sequences of length n, X, to the channel. The 
decoder maps the output sequences y onto received-messages 
W. We seek encoders that produce a non-confusable, widely 
spaced input sequences to the channel. The highest rate, in bits 
per channel use, at which information can be sent with arbi- 
trary low probability of error is called channel capacity. The 
encoders at capacity, X"*, satisfy Pr{error} 0, or equivalently, 
4iamm(x;, y,0 0 (measured at the decoder), where 4amm is the 

Hamming distance^, and x,-, y, are the input and output sequences, 
respectively. 



3 The Hamming distance between two strings of equal length is the number of 
positions at which the corresponding symbols are different. 
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To measure auditory channel capacity we translated the clas- 
sic derivation (e.g., Shannon, 1948) into a psychophysical pro- 
cedure. The auditory analog to the communication system in 
Figure 1 A is shown in Figure IB. The auditory channel is defined 
as follows: 

Definition: The auditory channel includes all pre-lexical layers, 
with acoustic waveforms as input and syllable objects as output. 
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FIGURE 1 | (A) A block diagram of a generic communication system. The 
encoder maps the source onto (binary) non-confusable, widely spaced input 
sequences to the noisy channel, so that a message can be transmitted with 
a desirably low probability of error. The maximum rate at which this can be 
done is called the capacity of the channel. (B) A block diagram of the 
auditory analog to the communication system in (A). The encoder maps 
words onto acoustic waveforms and is defined by the time-compression 
factor, k, and the parameters of the repackaging process, i.e., the 
packaging rate tp and the packet duration S (see Figure 2). The channel is 
the auditory channel and the decoder is the cortical receiver, both defined 
in section "Psychophysical measurement of auditory channel capacity." 



Corollary: The first layer of the cortical receiver is the lexical-access 
circuitry (i.e., words as output). 

Such a partitioning of the auditory system stems from the pos- 
tulation that, when engaging in a spoken dialog, the small- 
est linguistic meaningful units are words (e.g., Cutler, 1994, 
2012). 

In the psychophysical realization, the encoding scheme is real- 
ized by a uniform time-compression operator, defined by the 
compression factor k, followed by repackaging. Repackaging is 
defined by two parameters, the packaging rate <f> and the packet 
duration S (see Figure 2). The encoder is denoted xf xS : the sub- 
script k is the compression factor, and the superscript </> x S 
defines the parameter space in the search for maximum intelli- 
gibility. The parameter values at optimum, k, <f>* and 8*, define 
the encoder at capacity X* — the most favorable for the auditory 
channel; <p* and S* define the maximum information transfer 
rate, hence enabling a quantitative estimate of auditory capac- 
ity. Since intelligibility is measured in terms of word accuracy, 
the search for optimal intelligibility restoration can be viewed 
as an act of minimizing D, D = <ihamm(w;, w,), where w;, w, are 
the spoken and perceived words, respectively. D is defined at the 
receiver, in compliance with our way of partitioning the auditory 
system where the first layer of the cortical receiver is assumed 
to spell words as output. We assume that the cortical receiver is 
error free: as described in section "Material and Methods," the 
behavioral task is a digit-string recognition, with a memory load 
of 4 digits. Such memory load is less than the immediate mem- 
ory span, and the duration of 4 digits is less than the memory 
decay time (=2 s, e.g., Cowan, 1984). Note that the assumption 
of an error free cortical receiver implies that errors are the result 
of erroneous representation of pre-lexical units, transmitted in 



K = 3| 





• 5 - packet duration 

• A - packet presentation duration 

• <j> = 4 _ packaging rate 




FIGURE 2 | Illustration of repackaging of a time compressed waveform. 

The upper panel shows the waveform and the spectrogram of a sentence 
time compressed by a factor of sr = 3. The compressed waveform is blindly 
segmented into packets with equal duration of S (red boxes, upper panel). 



The lower panel shows the time-compressed waveform after repackaging, 
with a packaging rate of <j> = \. The acoustic signal inside the 5-long packet is 
the time-compressed signal. A low-level background, speech-shaped noise is 
added (with SNR = 20dB). 
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a rate beyond capacity (i.e., errors are induced by the auditory 
channel). 

3. MATERIALS AND METHODS 

3.1. SUBJECTS 

All listeners, eight in number, were young adults (four female and 
four male college students, between 20 and 25 years of age) edu- 
cated in the U.S.A. (English as first language) with normal hearing 
(screened for normal threshold audiograms). Their responses 
were reasonably consistent with each other, hence no further 
recruitment was needed. 

3.2. CORPUS 

The experimental corpus comprised 100 digit strings spoken 
fluently by a male speaker. Each string is a 7-digit sequence, 
approximately 2 s long. It is uttered as a phone number in an 
American accent, i.e., a cluster of 3 digits followed by a cluster of 
4 digits (for example: "two six two, seven zero one eight"). It is a 
low perplexity corpus (a vocabulary of 1 1 words, 0 to 9 and O) but 
semantically unpredictable. Each waveform file is accompanied by 
a phonetic transcription file, which includes the time instances of 
all acoustic landmarks including, in particular, vocalic nuclei (i.e., 
mid vowel markers 4 ). These were marked by experienced phoneti- 
cians (by hand). For each signal condition, 80 stimuli (out of 100) 
were chosen at random and concatenated in a sequence: [alert 
tone] [digit string] [5-s long silence gap] [alert tone] . . . 

3.3. EXPERIMENTAL PARADIGM 

Subjects performed the experiment in an isolated office environ- 
ment (no other occupants) using headphones. The sound pres- 
sure was adjusted by the subject to a comfort level and remained 
unchanged throughout the experiment. Stimuli were presented 
diotically. Each subject was tested on 50 signal conditions over- 
all in 10 2-h sessions (5 conditions per session). Each condition 
was presented once, and the order of presentation was the same 
for all subjects. A condition comprised two phases, Training and 
Testing. The training set and the testing set contained 10 and 80 
digit strings each, respectively, approximately 10 min to complete. 
Training preceded testing; in the training phase, subjects had to 
perform above a prescribed threshold before proceeding to the 
testing phase. Subjects were instructed to listen to a digit string 
once and, during the 5-s long gap following the stimulus, to type 
into an electronic file the last 4 digits heard, in the order presented 
(always 4 digits, even those that she/he was uncertain about). The 
rational behind choosing the last 4 digits as target (as opposed 
to choosing the entire 7-digit string) was two fold. First, it was 
an attempt to provide the opportunity for the presumed (corti- 
cal) theta oscillator to entrain to the input rhythm prior to the 
occurrence of the target words (recall the inherent rhythm in the 
stimuli, being a 7-digit phone number uttered in an American 
accent). Second, it aimed at reducing the bias of memory load on 
the error patterns. 

The human-subjects protocol for this study was approved by 
the Institutional Review Board of Boston University. A participant 



Note that the definition of a mid-vowel location is loose, within a time 
interval in the order of a few pitch cycles. 



provided hers/his written informed consent to participate in this 
study. This consent procedure was approved by the Institutional 
Review Board of Boston University. 

3.4. DATA ANALYSIS 

The digit-string comprehension accuracy was measured as fol- 
lows. Per stimulus, digit-string comprehension was define as 
string correct C„ with C, = 1 when the last 4 digits — as a whole — 
are correctly understood, and 0 otherwise. Per experiment, the 
data comprises 8 subjects, each of which was tested under N con- 
ditions, f e [1,2 N], with 80 sentences heard under each 

condition (For example, in Experiment I, iff is the compres- 
sion factor k, k € {2, 3, 4, 5}, i.e., N = 4). A hierarchical logistic 
regression was used to model the data, capturing the effect of each 
subject and each condition iff on digit string comprehension. This 
approach is conceptually similar to a classical ANOVA compari- 
son (Gelman, 2005): (a) inferences for all means and variances are 
performed under a model with a separate batch of effects for each 
row of the ANOVA table; (b) the model automatically gives the 
correct comparisons even in complex scenarios; and (c) this is a 
preferred approach when dealing with small sample size, as is the 
case here with only 8 subjects. 

The model provides estimates for the average accuracy at 
each level of iff. Instead of simply reporting standard errors for 
significance testing, this approach allows the flexibility of fully 
propagating the uncertainty inherent in all pieces of the model 
(Gelman and Hill, 2007). Here, this was done through a simula- 
tion framework, where the models estimates were simulated 1000 
times. We computed 95% credible intervals around the accuracy 
levels at each iff — these are the Bayesian equivalent of confidence 
intervals, again accounting for the full uncertainty in the model 5 . 

The results plotted are estimates of percent correct, shown for 
each iff, with error bars indicating the 95% credible intervals. 
Visually, we emphasize the credible interval around the esti- 
mated accuracy of iff* — the reference condition. The estimated 
accuracy of the surrounding conditions are compared to the esti- 
mated accuracy of the reference condition, and the error bars 
indicate whether the differences are statistically significant when 
considering the credible intervals. 

3.5. DEFINITIONS 

Three quantities are defined, which will assist us in characterizing 
the relationship between the rate by which speech information 
is delivered to the listener, on the one hand, and intelligibility 
(i.e., a measure of the accuracy of speech perception), on the 
other. The first quantity is the Articulated Speech Information 
(ASI), a measure of the amount of information carried by a frag- 
ment of time-compressed speech. The second quantity is the 
ASI-Rate — the rate by which the ASI is delivered. These mea- 
sures characterize stimulus properties and have nothing to do 
with perception. The third quantity is the 8 -syllable, an acous- 
tic correlate of a unit of speech information defined by cortical 
function. 



5 Because these simulations are not simply standard error calculations, the 
credible intervals are not restricted to be symmetrical around the mean, as 
can be seen under close inspection of the data later on. 
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3.5. 1. Articulated Speech Information (ASI and ASIr) 

Since listeners are presented with time-compressed versions of the 
original waveform, a question arises: how to quantify the amount 
of information carried by a fragment of a time-compressed speech? 
For example, what is the amount of information within a 40-ms 
long interval of speech, time-compressed by a factor of 4? We pro- 
pose to measure this quantity in terms of the information that was 
intended to be conveyed by the speaker when uttered (i.e., before 
compression). 

Definition: the Articulated Speech Information (ASI), denoted n, 
carried by a 5-long fragment of a /(-compressed stimulus is the 
amount of information, in bits, in the corresponding uncom- 
pressed fragment. 



40-ms long fragment of speech time-compressed by 4 is related to 
an ASIr that equals jt T = 40 • 4 = 160 ms^ . 

It is worth emphasizing that there is a distinction between 
ASI, the amount of information articulated by the speaker (i.e., 
intended to be conveyed), and the amount of information per- 
ceived by the listener. During the decoding process some of the 
articulated information may be lost; the amount of the loss 
depends on k and is measured with respect to the ASI. 

3.5.2. ASI-Rate and ASIr -Rate 

Let ASI-Rate — or, equivalently, ASIr -.Rate — be the information 
rate in transmitting ji bits of ASI — or, equivalently, jr T mSjr of 
ASIr — by a S-long fragment of k -compressed speech, and let 
both be denoted R K S . Then: 



Note that the speech fragment in question is arbitrary, i.e., it 
doesn't have to be aligned with any particular linguistic unit. 

In our study a speech corpus with low perplexity is used 
(7-digit strings). In this case, it is reasonable to assume that the 
ASI carried by a speech fragment that is a few tens of milliseconds 
long is related to the duration of the uncompressed fragment, i.e., 
jt ~ S-k (see Figure 3). 

Definition: ASIr, denoted n T , is an estimate — in time units — of 
the ASI carried by a 5-long fragment of a /(-compressed stimu- 
lus, equals &-k. To distinguish duration (of a time-interval) from 
ASIr — both measured in time units — we denote 1 ms of ASIr as 
1 msjt- 

That is, for the 7-digit strings corpus we assume {ASI, in bits} ~ 
{ASIr, in ms^ }. In our example, the ASI (jc, in bits) carried by a 



Ri 



bits/s ' 



msjr/s 



In the reminder of the paper we shall omit, for simplicity, the 
subscript and superscript of R$ using R instead, measured in 
ms^/s. 

3.5.3. The0 -syllable 

A widely accepted assessment is that a consistent acoustic cor- 
relate to the (conventional) syllable is hard to define (e.g., 
Cummins, 2012). Concurring with this assessment, and in light of 
the proposed role of the theta oscillator in governing the decod- 
ing process (e.g., Ghitza, 201 1; Giraud and Poeppel, 2012), Ghitza 
(2013) suggested the f?-syllable as an alternative unit, inspired by 
brain function: 




/c = 3 1 
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duration — 5 ms 
7T bits 
7r T — S-K ms n 



duration .^■fe'"ms 
7r bits 

7T T =5-K mSjr 



< k - compression ratio 

< 5- duration of time-compressed segment 

< 7T - Articulated Speech Information (ASI), in bits 

< 7r T - quantitative measure of ASI, denoted ASIr, in ms^ 



FIGURE 3 | What is the amount of speech information carried by a 
fragment of a time-compressed speech? We define Articulated Speech 
Information (ASI) carried by a 5-long segment of a ^-compressed stimulus 
(red box in lower panel) as the amount of information, in bits, in the 



corresponding uncompressed segment (red box, upper panel). ASI is the 
speech information that was intended to be conveyed by the speaker when 
uttered (i.e., before compression). See text (section "Definitions") for the 
definition of ASIr. 
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Definition: A (9-syllable is a 0-cycle long speech segment located 
in between two successive vocalic nuclei. 

During a successful tracking by the theta oscillator (for uncom- 
pressed speech, in quiet, this is the normative case) one 0-cycle 
is aligned with the interval between two successive vocalic nuclei. 
As such, the 0 -syllable is a non-ambiguous acoustic correlate to 
a VEV (the S stands for consonant cluster). Given the promi- 
nence of vocalic nuclei in the presence of environmental noise, 
the f3-syllable is robustly defined. The ^-syllable is also invariant 
to time scale modifications that result in intelligible speech. When 
listening to time-compressed speech that is intelligible, the corti- 
cal theta is in sync with the stimulus. Thus, the speech fragment 
that corresponds to a theta cycle is the time-compressed version of 
the corresponding uncompressed VEV fragment (Ghitza, 2013). 

4. RESULTS 
4.1. OVERVIEW 

Three experiments were conducted. In Experiment I, listeners 
were presented with time-compressed speech without repackag- 
ing, with the time-compression factor, k , the parameter. Speech 
information is delivered in a "natural way," i.e., the "packaging 
rate" is the syllabic rate of the stimulus and a packet is the time- 
compressed ^-syllable. The goal is to find k*, the k at knee-point 
of performance. The 6 -syllable rate at knee point is denoted (p* , 
and the average "packet presentation" duration is the duration 
of a 4>* cycle, A* = In Experiment II, k is increased beyond 
k*, resulting in a deterioration in performance. Intelligibility 
is recovered by launching the repackaging process depicted in 
Figure 2, with a parameter search in the 0x5 space (i.e., the 
[packaging-rate] x [packet-duration] space). The parameter val- 
ues at optimum, (p° and 5°, define the information rate at the 
optimal recovery point, denoted R° 6 . This process is repeated for 
every value of k, k > k*; as we shall see, _R° is independent of k. 
In Experiment III, we verify that R° is indeed an estimate of the 
auditory channel capacity. 



Let k at knee-point be denoted k*. We define: 




Ty-^y is the duration of an intervocalic segment at k* (equals 
the difference between two successive vocalic nuclei marked as 
described in subsection "Corpus"), (p* is the average natural pack- 
aging rate of the k* -compressed waveform, A* is the average 
packet presentation duration, and n * and _R* are the average ASIr 
and the average ASIr-Rate at knee-point, respectively. The drop 
in performance for k > k* is interpreted to be the result of the 
cortical 6 reaching the upper limit of its frequency range, # max 
(Ghitza, 20 1 1 ) . A corollary to this interpretation is that </>* reflects 
#max- Note that, biophysically, # max is not a cutoff frequency in a 
"brick-wall" sense; rather, 6 diminishes in a gradual manner. In 
the reminder of the paper we shall assume a brick-wall f? max . 

4.2.2. Data 

The results are shown in Figure 4. Estimates of word recognition 
accuracy (in percent correct) are shown for each a: e {2, 3, 4, 5}, 
with error bars indicating the 95% credible intervals. To deter- 
mine the knee-point of performance we compare the estimated 
accuracy at a prescribed candidate condition with the accuracy 
at the preceding and following conditions. Shown is a candi- 
date condition k = 3, with the credible interval around it visually 
highlighted (gray horizontal strip). The estimated accuracy at 
k = 3 is 96% — quite close to 99% (average accuracy when k = 2) 
and considerably better than 91% (when k = 4). The error bars 



4.2. EXPERIMENT I: INCREASE k TO KNEE-POINT OF PERFORMANCE 

4.2. 1. Stimulus preparation 

The compression factor, k, was gradually increased to a knee- 
point of performance, measured in terms of word recognition 
accuracy. The waveforms were time-compressed using a pitch- 
synchronous, overlap and add (PSOLA) procedure (Moulines and 
Charpentier, 1990) incorporated into PRAAT, a speech analysis 
and modification package (http://www.fon.hum.uva.nl/praat/). 
The formant patterns and other spectral properties of the time- 
compressed signal are preserved but altered in duration (compare 
upper and lower panels in Figure 3), however, the fundamen- 
tal frequency ("pitch") contour remains the same 7 . Note that, by 
definition, the ASIr within a a: -compressed 9 -syllable (i.e., an 
intervocalic segment, k -compressed) is same for all k, equals to 
7i x msjr. 



6 Note that we use different superscript symbols to indicate optimum, * 
for the compression without repackaging, and 0 for the compression with 
repackaging. 

7 Preserving the pitch contour is the main motivation for using the PSOLA 
methods. 
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FIGURE 4 | Time compression without gaps. Shown are estimates of 


word recognition accuracy (in percent correct) for each k e {2, 3 
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with error bars indicating the 95% credible intervals. The knee-point is at 


k* = 3, with the credible interval around it visually highlighted (gray 


horizontal strip). 







Frontiers in Psychology | Auditory Cognitive Neuroscience 



July 2014 | Volume 5 | Article 652 | 6 



Ghitza 



Cortical 6 oscillations determine channel capacity 



K = 5 




380 430 

ASI T (ITIS*) 



FIGURE 5 | Time compression with k > 3. Such degree of time 
compression results in a massive deterioration in performance. To recover 
performance repackaging was applied, with a packaging rate of tp* = 9 Hz. 
Five panels are shown, one for each *:. For each panel, estimates of 
accuracy (in percent correct) are shown for each n x e {230, 280, 330, 
380, 430) ms„ with error bars indicating the 95% credible intervals. The 
knee-point of recovery is at 330 ms,, with the credible interval around it 
visually highlighted (gray horizontal strip). ASIr at knee-point is a constant, 
independent of k, equals the average duration of one uncompressed 
6-syllable and delivered in k -compressed ^-syllable long packets. Since the 
packaging rate <j>* = 9 Hz (interpreted to be equal to cortical 6 max ), the 
information transfer rate at knee-point of recovery is 9 6-syllables/s. 



indicate that, in both cases, the differences are statistically signif- 
icant when considering the credible intervals. Consequently, the 
knee-point is determined to be k* = 3. 

Using Equations (l)-(4) we obtain that at *r*=3: <p* = 9 Hz, 
A* = 110 ms, it* = 330 ms„, and R* = & = fig =3ms I /ms. 
In words, at knee-point, the average packaging rate is 9 0- 
syllables/s, a packet is a k* -compressed 0 -syllable with an average 
duration of 1 10 ms, the ASIr carried by a packet is the duration of 
an uncompressed 0 -syllable with an average duration of 330 ms^, 
and the information transfer rate is 3 ms of ASIr (measured in 
msjr ) per 1 ms of time-compressed waveform. 

4.3. EXPERIMENT II: INCREASE k BEYOND KNEE-POINT 

4.3. 1. Stimulus preparation 

The compression factor, k, was increased beyond k*, resulting in 
a massive deterioration in performance (see, for example, per- 
formance at k = 5, shown in Figure 4). To recover performance 
repackaging was applied. In accordance with the interpretation 
that 0* reflects # max (subsection "Experiment I") packaging rate 
was frozen at <p* for all values of k, k > 3, leaving the packet 
duration, <5, as the only varying parameter in the search for opti- 
mal recovery. Packet duration at knee-point of optimal recovery 
is denoted 5°, and the ASIr carried by this packet is: 

Tt°=8°-K (5) 

hence the ASIr -Rate: 



We seek R" (the ASIr -Rate at optimal recovery) as a function 
of k. Since A* is same for all k (because <p* is frozen), seeking 
R° is equivalent to seeking n° [the ASIr at optimal recovery, see 
Equation (6)]. 

4.3.2. Data 

R° was measured for k e {4, 5, 6, 7, 8}. For each k, packaging 
rate was frozen at <p* = 9 Hz (with A* = 110 ms), and packet 
duration 8 was the search parameter. Five values of 8 were used, 
defined by five prescribed values of ASIr: tz x = [230 280 330 
380 430] ms,,-. (Note that the mid-value of the five-value ir r is 
330 msjr — the ASIr at the knee-point k* = 3; see Experiment I.) 
Same five-value ir r was used for all k. For a given k, 8 was derived 
from tc x as: 

S = — ms (7) 

K 

For example, for k = 5 Equation (7) yields 8 = [46 56 66 
76 86] ms. With packaging rate frozen at 4>* = 9 Hz, the five-value 
8 defines five repackaging conditions per k. 

The results — shown in Figure 5 — are organized in five pan- 
els, one for each tc e {4, 5, 6, 7, 8}. For each panel, esti- 
mates of accuracy (in percent correct) are shown for each ii r e 
{230, 280, 330, 380, 430} ms w , with error bars indicating the 
95% credible intervals. To determine the knee-point of perfor- 
mance we compare the estimated accuracy at a prescribed candi- 
date condition with the accuracy at the preceding and following 
conditions. Shown is a candidate condition n T = SSOmSjr, with 



the credible interval around it visually highlighted (gray horizon- 
tal strip). The estimated accuracy at ic T = 330 ms K is quite close 
to the accuracy at ji x = 280msjr, and considerably better than 
the accuracy at it x = 380mSjr (this is especially so for k = 6, 7, 
and 8). The error bars indicate that the differences in estimated 
accuracies are statistically significant when considering the credi- 
ble intervals. Consequently, the knee-point is determined to be at 
tc ° = 330 msjj . Relating this finding to the finding of Experiment 
I reveal: 

Tt° x = Tt* = 3301^, V/c (8) 
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FIGURE 6 | Are we at capacity? Performance for combinations of 




packaging-rate x packet-duration with information rates R greater than 




R° — the rate at optimal recovery. (Left) Estimated accuracy as a function of 


packaging rate 0 > 0* = 9 Hz. For all 


0, packet duration is such that ASIr is 


a constant (equals 330 ms,). The reference condition is at R° (i.e., 




0* = 9 Hz and it° = 330 ms,], with the credib 


e interval around it visually 




highlighted (gray horizontal strip). {Right) word accuracy as a function of 




ASIr > -r r c 


= 330 ms, . Packaging rate was reduced to 0 = 5 Hz in order to 


maintain a packet duration S that is smaller than the packet presentation 




duration - 


. The reference condition is at R°, denoted 330* in the figure (the 


star indicates 0* = 9 Hz, as opposed to 0 = 5 Hz in all other ASIr values). 




with the credible interval around it visually highlighted (gray horizontal strip). 


In both, left and right columns, R° gives the best performance => R° is the 


auditory channel capacity, C all ditory = 9 6-syllables/s. 





That is, ASIr at knee-point of recovery is a constant, independent 
of k , equals the average duration of one uncompressed ©-syllable 
and delivered in k -compressed 6 -syllable long packets. Since the 
packaging rate is 4>* = 9 Hz (interpreted to be equal to cortical 
#max)> the information transfer rate at knee-point of recovery is 9 
f?-syllables/s. Or, expressed in ASIr-Rate: 

R" = ^ = ^- = R* = 3msJms, Vk (9) 
A* A* 

That is, the ASIr-Rate is a constant, equals to R* = Sms^/ms, 
for all k. 

4.4. EXPERIMENT III: ARE WE AT CAPACITY? 

4.4. 1. Stimulus preparation 

In Experiment II we found that the ASIr-Rate at optimal recov- 
ery is R°=R* = 3mSjr/ms, for all k's. The (p* and 8° combina- 
tion that determined .R 0 was <f>* = 9 Hz and 8" — the duration 
of a -compressed speech fragment with ASIr tt° = 330 ms^-. 
For R" to be considered capacity we must show that there 
exist no R > R° which maintains performance. In the exper- 
iment described here we measured performance for R's with 
R > R°, and found that performance deteriorated for all R's 
tested, thus concluding that R° is indeed an estimate of auditory 
capacity. 

4.4.2. Data 

Recalling that R° = |£ = jr T ° • <j>*, we obtained R > R" by 
using </>></>* while keeping it x = tz°. In particular, we used 
7r T = 330ms w and <j> = [12 15 18 21] Hz R = \ ■ 4> = 
[4 5 6 7] mSjf/ms (each entry greater than R° = 3 mSjr/ms). The 
results — shown in the left-hand-side column of Figure 6 — are 
organized in three panels, one for each k g [6, 7, 8}. For each 
panel, estimates of accuracy (in percent correct) are shown for 
each <p € {9, 12, 15, 18, 21} Hz, with error bars indicating the 
95% credible intervals. The reference condition is at R° (i.e., 
0* = 9 Hz and 7r" = 330 ms^ ), with the credible interval around 
it visually highlighted (gray horizontal strip). 

We also measured performance for 7r r > n°, i.e., a packet 
duration 8 = > 8°=-r-, the duration at optimal recovery. 
We chose 8's defined by 7r r = [380 430 480] m%. In order to 
maintain a packet duration 8 that is smaller than the packet 
presentation duration i, packaging rate was reduced to 0 = 

5 Hz. Note that for such choice of 4>, R = [1.9 2.15 2.4] ms^/ms 
(each entry smaller than ^° = 3mSj/ms). The results — shown 
in the right-hand-side column of Figure 6 — are organized in 
three panels, one for each k e {6, 7, 8}. For each panel, esti- 
mates of accuracy (in percent correct) are shown for each n x € 
{330*, 380, 430, 480} ms,,-, with error bars indicating the 95% 
credible intervals. The reference condition is at R", denoted 330* 
in the figure (the star indicates <p* = 9 Hz, as opposed to <f> = 5 
Hz in all other 7r r values), with the credible interval around it 
visually highlighted (gray horizontal strip). 

In both tests R" gives the best performance, leading to the con- 
clusion that R", indeed, is the auditory channel capacity, denoted 

^-auditory- 



5. DISCUSSION 

Conceptually, information transfer rate can be expressed 
in units of bits/s (ASI-Rate), ms„/s (ASIr-Rate), or 
S-syllables/s. As we shall see in subsection "How generaliz- 
able are our findings?," f9-syllables/s is the most insightful 
unit. 

In Experiment I we found that for time compression with- 
out repackaging, knee-point of performance is at k* = 3. The 
"natural packaging" rate (i.e., the syllabic rate) is cp* = 9 natural- 
packets/s — in correspondence with f? max , the upper limit of 
cortical theta (=9 Hz) — and one natural-packet contains one 
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FIGURE 7 | Packaging rate, 0*, and packet duration, 5°, at capacity. 

For uncompressed speech (i.e., fc = 1, not shown), speech information 
is delivered naturally: the packaging rate is the nominal syllabic rate 
(= 3 syllables/s, for our speech corpora) and a packet is a fl-syllable 
with an average duration of = 330 ms. (A) Knee-point of performance 
for uniform time-compression without gaps, tc* = 3. Speech information 
is delivered naturaily, where the packaging rate, 0*, is the syllabic 
rate of the stimulus (= 9 syllables/s), in correspondence with the 
upper limit of theta, 6 max = 9 Hz. The duration of a 0* cycle — the 
packet presentation duration — is A* = 1/0* = 110ms, and the average 
natural-packet duration is S* = A* = 110 ms. (B) A uniform compression 



with k — 4, which results in a deterioration in performance, is 
followed by repackaging to restore performance. Packaging rate is kept 
at 0* = 9 packets/s, hence A* = 110ms. Packet duration at optimal 
restoration is the duration of a ^-syllable, time-compressed by k = 4, 
i.e., 8° = 330/4 = 82.5 ms. Entries in the remaining rows are derived 
in an analogous manner. Note that in rows (B-D) packets are 
delivered with an identical packaging rate, and the articulated speech 
information — in terms of time-frequency signature — carried by a 
particular packet in rows (CD) is the same as in the corresponding 
packet in row (B), although with different acoustic realization (due to 
different compression factor). 



f?-syllable [Figure 7, row (A)]. Hence, the information transfer 
rate, in units of 8 -syllables/s is: 

R* = 9 # -syllables/s 

Since the corresponding ASIr is re* = 330 niSjr, and the duration 
of a natural packet is 8* = A* = = 110 ms, the information 
transfer rate in units of ms^/s is: 

it* 330 

R = — = = 3 mSjr /ms 

S* 110 

In Experiment II we found that for all k > 3, with packaging 
rate of <p° = <p* = 9 packets/s, at knee-point of intelligibility 
recovery a packet carries an ASI of one 8 -syllable long speech 
fragment. Hence, the information transfer rate in units of 9- 
syllables/s is: 

R° = 9 8 -syllables/s = R* Vk > 3 

The packet duration equals the duration of the 8 -syllable com- 
pressed by k [Figures 7, rows (B-D), and the corresponding 
ASIr, n° = 330mSjr, is delivered within a packet presentation 
duration of A 0 = A* = 4r =110ms. Therefore, the information 

0* 

transfer rate, in units of mSjr/s is: 

R° = — = = 3 msjr/ms = R* V/c > 3 

A* 110 

Finally, in Experiment III we found that performance deteriorates 
for all R>R° or n T >n° tested. 



Based on these findings we conclude: 

1. The auditory channel can reliably transmit, at most, the ASI 
in one 8 -syllable long speech fragment per one f9 max cycle, 
independent of k . 

2. R° is the auditory channel capacity, C au ditory This is so 
because all other combinations of [packaging-rate] x [packet- 
duration] with higher bit rates result in higher error rates. 
Expressed in 9 -syllables/s, C au ditory = 9 8 -syllables/s. 

3- Cauditory is determined by cortical 9. This is so because for 
all k, at capacity, the maximum information reliably decoded 
is the ASI of one 8 -syllable long speech fragment, delivered 
in /f-compressed f9-syllable long packets in a rate of </> = 9 
packets/s = cortical f9 max . 

5.1. RELATION TO OSCILLATION-BASED MODELS 

In accordance with our definition (see section "Psychophysical 
measurement of auditory channel capacity"), the auditory chan- 
nel includes all pre-lexical layers (including Tempo), with acoustic 
waveforms as input and 8 -syllable objects as output. Reiterating 
the cortical computation principle embodied in Tempo, the 
speech decoding process is performed within a hierarchical win- 
dow structure synchronized with the input, generated by a cas- 
cade of oscillations capable of tracking the input pseudo-rhythm. 
Performance remains high as long as theta, the master, is in sync 
with the input, and sharply deteriorates once theta is out of sync. 

Examining the findings of our study through the prism 
of Tempo, for time-compressed speech with k < 3 and with- 
out repackaging, the syllabic rate is within the theta range. 
Synchronization is thus maintained and theta cycles are aligned 
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with intervocalic acoustic segments (i.e., 9 -syllables). For k > 3 
performance sharply deteriorates because the syllabic rate (now 
greater than 9 syllables/s) is outside the range of theta =>■ theta is 
out of sync. Repackaging restores intelligibility. A revealing find- 
ing is that, at capacity, with a packaging rate of 9 packets/s (and 
synchronization now maintained), a packet contains the informa- 
tion in a speech fragment that is one uncompressed 9 -syllable 
long, independent of k (the duration of the packet equals one 
k -compressed f9-syllable). 

5.2. SYNTHESIS BY REPACKAGING: ACOUSTICS vs. INTELLIGIBILITY 

There is a distinction between the speech information carried 
by a stimulus and the speech information reliably perceived by 
the listener. The repackaged stimuli are assumed to contain all 
speech information articulated by the speaker (i.e., intended to 
be conveyed). (This assumption is based upon objective crite- 
ria, e.g., the ability to recover the uncompressed signal from the 
repackaged version.) During the human decoding process, how- 
ever, some of this information is lost, and the extent of loss is 
quantified by measuring intelligibility. In this study, stimuli were 
defined by the repackaging parameters k, <p, and 8, and capacity 
was defined as the knee-point of intelligibility recovery. What are 
the auditory functions responsible for the intelligibility loss when 
listening to repackaged stimuli, and how the synthesis parameters 
(which define the stimulus) and the auditory channel param- 
eters interact? We shall use the Tempo model to examine this 
interaction. 

According to Tempo, as long as (p is inside the cortical 9 fre- 
quency range, the window structure is determined by 0 (Ghitza, 
2011): cortical 9 is in sync with <p, and as the master in the cas- 
caded oscillators array it determines /3 and y (via cascading). The 
f) cycles (entrained to 9) define the windows within which the 
phonetic content is decoded, and the decoding is via sampling the 
sensory information inside the fi cycle in a y pace (entrained to 
/J); the sampling time-instances are in phase with the fi cycle (see 
Appendix in Ghitza, 201 1). 

Two cases of stimulus vs. auditory parameter interaction 
are examined. First, as described in the "Stimulus preparation" 
subsection of Experiment I, the uniform time compression is 
in the PSOLA sense; i.e., only the vocal-tract movement is 
speeded up while the pitch contour remains unchanged. If the 
packet duration of a repackaged stimulus (S) is smaller than 
one pitch-period the pitch contour is severely distorted, result- 
ing in deterioration in intelligibility. For all stimuli used in our 
study, a packet lasted a few pitch periods (see, for example, 
Figure 7). 

Second, the accuracy of decoding depends on the interaction 
between the stimulus parameters k and 8, and the auditory 
parameter y. In particular, if the duty cycle of the repackaged 
stimulus is two small (i.e., if <5 is too short compared to the <p 
cycle), the y-driven sampling may be too coarse (recall that y 
is dictated by <p, via cascading). Undersampling will also occur 
if the signal inside the packet is overly compressed (/c is too 
large). These examples illustrate that, for a given 4>, intelligibil- 
ity is affected by the choice of k and S. Interestingly, our study 
shows that for all five repackaging conditions tested (i.e., k e 
{4, 5, 6, 7, 8}, all with <f> = 9 Hz), capacity is reached for a 8 



that is a k -compressed 9 -syllable long speech fragment. The fact 
that, at capacity, both <p and 8 correspond to cortical 9 leads 
to the inference that auditory channel capacity is determine by 
cortical 9. 

5.3. HOW GENERALIZABLE ARE THESE FINDINGS? 

Our estimate of auditory channel capacity, C a uditory> was mea- 
sured for English digit strings spoken by a male talker speaking 
in a "nominal" rate. Will this estimate generalize to digit strings 
spoken by a "fast" talker? to English speech corpora with higher 
perplexity? to speech corpora in other languages? 

In Shannon's framework, capacity is determined by the chan- 
nel (Shannon, 1948). Note that the auditory channel as we define 
it (see section "Psychophysical measurement of auditory channel 
capacity") is a time-varying channel: because it operates within a 
window structure synchronized with the input rhythm, the audi- 
tory channel is a function of the input, hence time-dependent. 
Nevertheless, at capacity the channel can be assumed stationary 
because the window structure is frozen as the master window is 
determined by 6* max . With this observation in mind, we suggest 
the following predictions: 

1. A 7 -digit strings corpus spoken by "fast" talkers. At capacity, 
packaging rate <p* = 9 packets/s, interpreted to be determined 
by 6* max = 9 Hz. If we assume same 9 max across gender and 
race (indeed species; e.g., Buzsaki et al., 2013), in a repeat of 
Experiment I, k at knee-point of performance («"^, st ) should be 
such that 0* = 6* max , with jt*, it* and R* as measured for the 
male talker. Since the syllabic rate for a fast talker is higher than 
the syllabic rate of a male talker, we expect k? t < k* = 3. In 
a repeat of Experiment II (now k > f* ast ) the search for opti- 
mal recovery of intelligibility should yield 8°, n" ' ,n° and _R° 
as measured for the male talker (as dictated by f? max ). We 
therefore predict that C au ditory — estimated for 7-digit strings 
spoken by a male talker — will generalize, in 9 -syllables/s, bits/s 
or msj/s units. 

2. English speech corpora with higher perplexity. Using a ratio- 
nal similar to the one used for fast talkers, in a repeat 
of Experiment I, k at knee-point of performance should 
be such that <f>* = 6* max , with a distribution of compressed 
9 -syllable durations similar to that of a compressed English 
digit-string source. However, the average ASI (in bits) car- 
ried by a 9 -syllable in a corpora with a higher perplexity 
would be greater than that of the English digit-string cor- 
pus (because of the reacher VEV inventory). It is therefore 
predicted that, expressed in 9 -syllables/s, capacity will gener- 
alize (to be C au <jit or y = 9 9 -syllables/s); however, if expressed 
in bits/s, the auditory channel capacity for English speech cor- 
pora will be greater than that for a 7-digit strings corpus (with 
lower perplexity). Measuring capacity in ms n /s units is inap- 
plicable here because the relationship at the core of the ASIr 
definition, i.e., {ASIt, in mSjr } ~ {ASI, in bits}, is no longer 
valid. 

3. Other languages. It has long been noticed that, across lan- 
guages, syllabic information density (i.e., the average infor- 
mation carried by a syllabic unit, in bits/syllabic-unit) and 
speech rate (in syllabic-units/s) interact in a negative high 
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correlation. Consequently, a language that carries less infor- 
mation per syllabic unit will "pack" more units per second, 
e.g., Spanish vs. German (e.g., Pellegrino et al., 2011). How 
these source properties across languages, measured in nomi- 
nal rates (i.e., below capacity) co-exist with our estimate of 
auditory channel capacity? Following the rational used before, 
we predict that in a repeat of Experiment I, k at knee-point 
of performance (k*) will be such that (p* = f9 max , with a dis- 
tribution of compressed 6 -syllable durations similar across 
languages, but with language-dependent average ASI (in bits). 
As such, k* should be a function of language, with lower values 
for languages with higher speech rate, e.g., /f s * panish < k £ erman . 
A corollary to this prediction is that our estimate of auditory 
channel capacity, expressed in #-syllables/s, will generalize (to 
be C au ditory = 9 f?-syllables/s); however, if expressed in bits/s, 
the auditory channel capacity for German will be greater than 
that for Spanish. 

It is worth emphasizing that our estimate of auditory channel 
capacity is only valid for young listeners with normal hearing 
(the age group of our subjects). There is a large variability in 
how listeners in different age groups perceive time-compressed 
speech, stemming from either (1) an underlying individual vari- 
ability in the range of cortical 6, or (2) other deficiencies of 
neuronal processing at play when listening to time compressed 
speech. As for the first possibility it may be that, for older adults, 
the frequency range of neuronal oscillations shifts downward. 
Therefore, a lower # max (compared to the young) may result in a 
reduction in auditory channel capacity. As for the second possibil- 
ity, some deficiencies were discussed in the previous subsection, 
"Synthesis by repackaging: acoustics vs. intelligibility." 

5.4. CAPACITY: AUDITORY CHANNEL vs. IMMEDIATE MEMORY 

Our way of partitioning the auditory system is shown in 
Figure IB. Oscillation-based models exist for both components 
of the system — the auditory channel and the cortical receiver — 
with theta oscillations at their core. As is re-iterated throughout 
the paper, the auditory channel contains oscillation-based func- 
tions (e.g., as in Tempo) with theta as master. Immediate memory 
circuitry, for words, belongs to the cortical receiver (with the 
lexical-access circuitry the first layer, with pre-lexical units as 
input and words as output). Recent oscillation-based models of 
memory circuitry suggest that encoding and retrieval of episodic 
memory takes place at different phases of theta (e.g., Hasselmo 
et al, 2002, 2009). Other models (e.g., Lisman and Idiart, 1995; 
Jensen and Lisman, 1996, 2005), propose neuronal networks with 
theta cycles at the core, subdivided into seven gamma subcy- 
cles. These networks form a short-term memory buffer that can 
actively maintain about seven memories, in correspondence with 
the capacity of human's immediate memory (e.g., Miller, 1956). 
Are the findings of our study — that the auditory channel capacity 
is determined by cortical theta — reflect channel limitations or the 
limitations imposed by immediate memory circuitry? 

Within the information-theory framework, channel capacity is 
defined as the maximum information rate, in units of encoder- 
symbols/s, that satisfies flawless performance measured at the 
(error-free) decoder. Auditory channel capacity, in particular, is 



defined as the maximum information rate, in 0-syllables/s, at 
the knee-point of performance measured at the cortical receiver 
in word accuracy sense. Thus, the auditory channel output is a 
sequence of pre-lexical units while the receiver operates on words. 
We assume an error-free receiver because the behavioral task is 
a digit-string recognition with a memory load of 4 digits: such 
memory load is less than the immediate memory span, and the 
duration of 4 digits is less than the memory decay time (=2 s, e.g., 
Cowan, 1984). The assumption of an error- free cortical receiver 
implies that (1) errors are the result of erroneous pre-lexical units 
at the channel output (i.e., the errors are induced by the audi- 
tory channel), and (2) there are no deficiencies in the immediate 
memory function (which stores words). 

Finally, it is worth noting that, in our view, the theta oscil- 
lators in models of the auditory channel are distinct from those 
in models of the memory. Tempo hypothesizes a special class of 
oscillators, which allow a gradual change in their frequency while 
tracking the slowly varying input speech pseudo-rhythm. Such 
class of theta oscillators is much different from the theta oscil- 
lators proposed for memory circuitry, which assume oscillations 
with fixed, time-independent frequency. 

6. SUMMARY 

Intelligibility of time-compressed 7-digit strings was measured 
as a function of speech speed and repackaging. Irrespective of 
speech speed, the maximum information transfer rate through 
the auditory channel, or auditory channel capacity, is the infor- 
mation in one uncompressed 6 -syllable long speech fragment 
per one f? max cycle, or 9 f?-syllables/s. Interpreted through the 
prism of oscillation-based models, the alignment of both the 
packaging rate and the information per packet with properties 
of cortical theta implies that the auditory channel capacity is 
determined by theta. We suggest that, in talker-listener com- 
munication, the appropriate unit to express speech information 
transfer rate is #-syllables/s. Expressed in f?-syllables/s, auditory 
channel capacity is constant over articulation speed and corpus 
perplexity (and languages, in particular), equals to 9 f?-syllables/s. 
Expressing auditory channel capacity in bits/s will result in a 
source-dependent estimates of capacity. 
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